# Multimodal video understanding
Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a powerful vision-language model with enhanced mathematical and problem-solving abilities, suitable for multimodal tasks.
Image-to-Text English
Q
unsloth
464
1
Xclip Large Patch14 Kinetics 600
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs through contrastive learning.
Text-to-Video
Transformers English

X
microsoft
124
5
Featured Recommended AI Models